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RANDOMIZATION  OF  SYMBOL  REPETITION  ON  PUNCH  CARDS  WITH 
SUPERIMPOSED  CODING  IN  INFORMATION-SEARCH  SYSTEMS 

L.  Ya.  Plrovich 

The  article  shows  the  effect  of  the  irregularity  of  using 
separate  symbols  on  search  noise  on  punch  cards  with  super¬ 
imposed  symbol  coding  in  information-search  system  (IPS). 

A  binomial  law  of  random  value  distribution  of  repetition  of 
each  symbol  is  established  and  analyzed.  A  method  of 
determining  the  maximum  value  of  symbol  repetition  is  pro¬ 
posed  and  an  example  of  calculating  this  value  for  an 
experimental  IPS  is  given. 

The  use  of  superimposed  symbol  coding  on  cards  with  edge  perforations 
and  the  use  of  slotted  and  machine  punch  cards  when  creating  mechanized 
information-search  systems  (IPS)  is  connected  with  the  appearance  of 
superfluous  punch  cards,  not  responding  to  the  search  Interrogation  during 
search  [1-3].  The  superlmposltlon  of  the  codes  is  determined  by  the  fact  that 
several  symbols  are  entered  on  a  single  code  field  of  the  punch  card.  The 
irregularity  of  encountering  separate  symbols  in  different  documents  has  a 
substantial  effect  on  search  noise^  of  this  system.  Single  S5nnbols  or  groups 
of  symbols  are  encountered  more  often  than  others  both  during  indexing  as 
well  as  during  search. 

We  know  [4,  5]  that  code  configurations  of  often  repeated  symbols  decrease 
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the  selectivity  of  the  codes,  containing  code  symbols  common  to  them  . 

As  an  illustration,  let  us  give  an  example  from  a  paper  [5].  The  symbols 
in  the  IPS 

strengthen  the  TVCh 
hardness  of  the  HRC  45-50 

the  diameter  of  the  information  slot  is  40  to  50  mm 
often  fall  together  on  a  single  punch  card. 

Let  us  write  their  code  configurations: 

jSee  p.  9. 
see  p.  9. 
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65-44-1^1 

72-1^-14 

^-36-13 

The  code  symbols  of  these  configurations  form  many  new  codes.  One  of 
them  Is  separated  by  squares.  Recovery  of  the  punch  cards  according  to  this 
code  configuration  will,  obviously,  ylelc  a  very  large  number  of  superfluous 
cards . 

The  presence  of  often  repeated  symbols  and  their  entry  onto  a  single 
code  field  of  the  punch  card  leads  to  an  Increase  of  the  average  search  noise 
of  the  system  and,  in  isolated  cases,  yields  an  especially  large  number  of 
superfluous  punch  cards. 

In  order  to  decrease  search  noise,  it  is  necessary  to  randomize  symbol 
repetition  of  the  IPS,  to  exclude  often  repeated  symbols  from  the  total  code 
field  of  a  punch  card,  and  to  enter  them  on  specially  isolated  fields. 

In  connection  with  this,  documentation  specialists  working  with  the  IPS 
will  be  Interested  in  the  acceptable  level  of  symbol  repetition,  above  which 
the  search  noise  of  the  system  will  be  Increased  and  will  exceed  established 
limits.  Paper  [5]  describes  an  IPS,  which  provides  for  the  exclusion  of 
symbols  from  the  overall  code  field  of  the  punch  card  according  to  the  Increase 
of  their  repetition  rate.  In  this  case,  it  is  also  very  important  to  know 
the  acceptable  extent  of  symbol  repetition,  upon  achievement  of  which  the 
symbols  will  become  more  repetitious. 

When  setting  up  the  main  part  of  the  punch  cards  of  the  IPS,  the  informa¬ 
tion  specialist  analyzes  each  document  in  the  IPS  storage  and  allots  to  it 
a  group  of  characteristic  symbols  (descripters) ,  available  in  the  dictionary 
(the  register)  of  the  system.  The  total  of  symbols  selected  is  the  search 
image  of  the  document  and  is  entered  on  the  punch  card. 

Thus,  all  documents  entering  the  IPS  storage  are  processed.  The  appearance 
of  one  or  another  symbol  is  a  random  event;  therefore,  we  shall  approach 
investigation  of  such  events  from  the  probability  theory. 

Let  us  assume  that  there  are  n  syzybols  in  the  IPS  dictionary,  that  an 
average  of  r  symbols  is  entered  onto  each  punch  card  and  that  the  main  part 
of  the  punch  cards  consist  of  Q  cards. 
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Isolation  In  the  document  of  the  total  of  random  S5nnbols  can  then  be 
presented,  using  standard  models  of  probability  theory,  as  the  random 
selection  of  symbols  of  volume  r  from  a  general  total  of  n  symbols.  Selec¬ 
tion  of  a  volume  r  is  non-recurring  selection,  because  only  r  different 
symbols  can  be  entered  onto  the  punch  card  and  no  S3nnbol  can  be  entered 
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twice  or  more  than  twice  .  Moreover,  the  order  of  s3nnbol  distribution  in 
this  selection  makes  no  difference. 

The  number  of  such  different  selections  from  a  general  aggregate  n  is 
equal  to  the  number  of  combinations  of  n  symbols  according  to  r  . 

Let  us  agree  to  consider  equally  possible  the  appearance  of  any  of  the 
random  selections  of  a  volume  r  with  a  probability  equal  to  1/C^- 

Let  us  select  at  random  one  of  the  n  symbols  and  investigate  a  principle 
of  its  repetition  during  compilation  of  the  main  part  of  Q  cards. 

The  probability  that  the  symbol  being  studied  will  fall  in  any  of  the 
selections  of  volume  r,  and,  consequently,  on  any  punch  card  is  equal  to 


Solving  this  equation,  we  obtain 


A  series  of  Q  Independent  Bernoulli  tests  are  performed,  each  of  which 
is  a  selection  from  the  overall  total  n  of  random  selection  of  volume  r.  The 
probability  that  a  symbol  will  appear  in  the  selection  is  equal  to  p,  the 
probability  that  it  will  not  appear  is  equal  to  q  -  1  -  p. 

Let  us  determine  the  probability  that  the  symbol  being  studied  will 
appear  K  times  in  Q  independent  tests.  Let  us  denote  this  probability  by 
p„  and  let  us  write  [7]: 

K  iQ 


Substituting  the  values  p  and  q  in  equation  (2),  we  obtain 


Q-K 


(3) 


^See  p.  9. 
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The  number  K  In  the  series  of  Q  tests  Is  random  and  may  assume  the  following 
whole  numerical  values 

Oj  1>  2;  ... ;Q~1 j  Q 
with  the  corresponding  probabilities 

Pq.Q*  Pl.Q’  P2,Q’  Pq-1,Q’  Pq.Q’ 

Equation  (3)  determines  the  binomial  probability  distribution  of  random 
value  K. 

AccuiTuxiig  Lo  [7]  the  mathematical  expectancy  (average  value)  and  the 
average  quadratic  deviation  (the  degree  of  scattering  of  the  values  of  K 
near  the  mathematical  expectancy)  for  a  random  value  of  K  comprises: 

Ar(/f)-Qp 

Substituting  the  values  p  and  q  in  eqiiatlons  (4)  and  (5),  we  obtain 

(6) 

(7) 

On  the  condition  of  the  equal  probability  of  random  selections  of  volume 
r,  we  hypothetically  established  the  binomial  distribution  law  of  the  dle> 

Crete  random  value  of  repetition  of  any  arbitrarily  selected  symbol  of  the 
IPS,  and  we  determined  Its  mathematical  expectancy  and  average  quadratic 
devlat Ion. 

The  IPS  sketches  of  mechanical  components  were  used  for  a  stetistlcel 
check  of  the  hypothesis  on  the  binomial  distribution  law  of  the  random  value 
of  symbol  repetition  [5]. 

A  sub-group  of  symbols  "lengths  of  components  or  surfaces  in  mn,"  contain¬ 
ing  45  symbols,  was  isolated  from  the  IPS.  It  is  alwst  equally  possible 
that  these  systole  will  be  encountered  on  the  documents. 

Further,  punch  cards  were  selected  on  which  symbols  from  this  sub-group 
were  found.  The  real  value  of  repetition  of  each  of  the  45  symbols  was 
computed. 


(4) 

(5) 
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As  a  result  of  statistical  processing  of  the  data  obtained  [8],  a  polygon 
of  distribution  of  the  discrete  random  value  of  symbol  repetition  was  plotted 
(Fig.  1). 

"Hi 
OH 
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The  values  of  the  symbol  repetition  value  K  were  plotted  on  the  axis  of 
the  abscissae,  and  corresponding  empirical  frequencies  of  the  random  value 
K,  obtained  in  90  experiments,  and  the  theoretical  probabilities,  computed 
by  formula  (3),  were  plotted  on  the  axis  of  the  ordinates. 

The  thin  broken  line  in  Fig.  1  is  the  graph  of  the  empirical  frequency 
distribution  of  the  random  value  of  K.  The  thick  broken  line  is  the  graph 
of  the  distribution  of  theoretical  probabilities  of  the  random  value  of  K. 

The  probability  that  the  empirical  and  theoretical  graphs  of  dlstri- 
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bution  will  coincide  according  to  Pearson's  agreement  criteria  X  at  four 
orders  of  freedom  is  equal  to  0.A86.  This  probability  is  quite  large; 
therefore,  the  hypothesis  that  the  random  value  of  syid>ol  repetition  is 
distributed  according  to  a  binmaial  law  may  be  considered  plausible. 

According  to  the  Holre-Lsplace  lisdtlng  theorem  [9]  according  to  the 
degree  of  increase  of  a  number  of  Independent  Bernoulli  tests,  the  binomial 
distribution  asymptotically  approaches  the  normal,  having  the  same  mathe¬ 
matical  expectancy  and  average  quadratic  deviation. 

In  our  case,  the  number  of  independent  tests  is  equal  to  the  number  of 
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punch  cards  Q  in  the  main  part  of  the  IPS.  Usually,  the  main  part  of  hand 
sorted  punch  cards  Includes  from  several  hundred  to  several  thousand  cards. 
The  main  part  of  machine  sorted  punch  cards  may  consist  of  several  tens  of 
thousands  of  cards  (we  recommend  up  to  100,000). 

At  a  given  probability  p  that  the  symbol  being  studied  will  be  entered 
on  the  punch  card,  we  can  determine  the  minimum  number  of  punch  cards 
In  the  IPS,  which  approximately  permits  the  application  of  normal  distri¬ 
bution,  rather  than  the  binomial.  In  order  to  do  this,  we  use  the  Inequality 
given  in  paper  {10]: 

\ln  -  p(l-p)  * 

In  the  Interval  0<p<0.1,  Qls  almost  Inversely  pr'^portional  to  p. 

Let  us  consider  for  an  experimental  IPS  as  an  Illustration  of  the 

given  equation  [51. 

The  IPS  parameters  are  the  following:  n  *  139  symbols,  r  «  6  symbols, 
according  to  formula  (1) 


'6 

1^0:0.043. 

Hence, 

0«i«>  1i,6<3.o.95r 

Obviously,  an  approximate  normal  distribution  of  the  random  value  of 
symbol  repetition  Is  suitable  for  most  IPS. 

This  circumstance  makes  it  possible  to  uee  the  well-studied  normal 
distribution  lav,  which  haa  been  dtscrlbed  in  detail  in  the  lltertture,  for 
analysis  of  tna  extent  of  symbol  repetition. 

According  co  the  normal  distribution  law  (11],  99.73X  of  all  values  of 
the  random  value  k,  that  la,  practically  all  values  of  the  repetition  value 
of  any  symbol  of  the  IPS,  must  be  located  in  the  range  from  M(K)  -  3o(K)  to 
H(X)  *  3a<X).  An  increase  in  the  number  of  repetition  of  separate  symbols 
has  an  effect  on  the  increase  of  search  noiae  of  the  aystei^.  Therefore,  %r 
are  Interested  in  the  upper  acceptable  limit  of  the  extent  of  symbol 
repetition,  at  which  the  average  search  noise  of  the  system  will  not  emceed 
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that  eatabllshed  for  a  given  IPS  level.  The  normal  distribution  law  provides 
an  ansv;rir  to  this  problem  and  permits  us  to  establish  the  quantitative 
character  of  the  distribution  value  of  symbol  repetition. 

Thus,  13.59%  of  the  repetition  value  of  each  symbol,  should  be  located 
in  the  range  from  M(K>  +  o(K)  to  M(K)  +  2a (K),  and  only  2.14%  —  in  the  range 
from  M(K)  +  2a(K)  to  M(K)  +  3o(K)  (Fig.  2). 


Imagine  that  the  repetition  values  K  of  all  n  equally-posslble  symbols 
are  a  frequency  of  the  values  of  the  random  value  K  of  one  symbol  in  n 
exp‘  .iments.  Then,  relying  on  the  given  positions  of  the  normal  distribution 
law,  W3  compile  a  correlation  table,  in  which  will  be  shown  the  dependence 
between  assumed  greatest  repetition  values  and  the  number  of  IPS  symbols 
corresponding  to  them. 


Correlation  Table 


Greatest  v  lue  of  symbol 

Number  of  symbols 

repetition,  R 

max 

(in  fractions  of  n) 

from  M(K)  +  a(K) 

0.136  (13.59%) 

to  M(K)  +  2o(K) 

from  M(K)  +  2a (K)  +  1 

0.021  (2.14%) 

to  M(K)  +  3a (K) 
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Let  us  cite  an  example  from  an  experimental  IPS  [5]. 

The  parameters  of  the  IPS  are  as  follows:  Q  »  177  punch  cards,  n  *  139 
s)ra»bols,  and  r  •  6  symbols. 

Using  equations  (6)  and  (7),  we  determine  the  mathematical  expectancy 
and  average  quadratic  deviation  of  the  random  yalue  of  repetition  K  of  any 
symbol. 


Using  the  correlation  table,  we  determine  that  139.0.136  =  19  and 
139.0.021  =  3  symbols  should  have  maxlminn  repetition  12  -  14  and  15  -  17 
times,  respectively. 

Preserving  in  IPS  [5]  the  specified  proportions  between  the  number 
of  symbols  and  their  repetition  value,  we  guard  the  system  against  an  increase 

4 

of  the  established  average  search  noise  due  to  often  repeated  S3nnbols  . 

S)rmbols  and  their  random  totals,  encountered  in  documents  are  usually  not 
equally  probable,  in  practlce- 

Vhls  is  particularly  true  of  often  repeated  symbols. 

The  proposed  mathematical  model  of  distribution  laws  of  symbol  repetition 
makes  it  possible  to  compute  for  any  IPS  according  to  its  basic  parameters: 
the  number  of  punch  cards  (of  documents)  Q,  the  number  of  symbols  n,  and  the 
average  number  of  symbols  on  the  punch  card  r,  which  is  the  maximum  acceptable 
value  of  symbol  repetition.  Symbols  which  repeat  the  greatest  nuoiber  of 
times  must  be  excluded  from  coding  in  the  normal  code  field  of  a  pdnch  card. 

Thus,  the  distribution  of  the  values  of  symbol  repetition  of  a  real  system 
artificially  approaches  the  theoretical  distribution. 


4 

See  p.  9. 
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Footnotes 


1.  To  p.  1.  Search  noise  [1]  Is  the  portion  of  those  delivered,  but 
unrelated  to  the  given  Information  Interrogation  of  the  punch  cards 
(of  the  documents) . 

2.  To.  p.  1.  The  code  symbol  Is  the  number  of  the  position  or  cell  of  the 
punch  card,  from  a  random  combination  of  which  the  code  configuration 
(the  code)  of  the  symbol  Is  formed. 

3.  'To  p.  3.  Non-recurring  selection  according  to  paper  [6]  Is  selection 
In  which  a  once  selected  element  Is  separated  from  the  overall  total, 
because  the  selection  contains  no  repeated  elements. 

4.  To  p.  8,  The  maximum  symbol  repetition  obtained  for  the  experimental 
IPS  of  [5]  Is  somewhat  Inaccurate,  l.e..  Instead  of  220  punch  cards, 
the  minimum  recommended  for  normal  approximation  of  the  binomial 
distribution  Is  177. 

r’ 
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